-
Notifications
You must be signed in to change notification settings - Fork 3
[6.18] Track btrfs patches #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: base-6.18
Are you sure you want to change the base?
Conversation
The following kernel message may be logged if `add_inline_refs()` or `add_keyed_refs()` block for too long: > kernel: rcu: INFO: rcu_sched self-detected stall on CPU > kernel: rcu: 10-....: (2100 ticks this GP) idle=0494/1/0x4000000000000000 softirq=164826140/164826187 fqs=1052 > kernel: rcu: (t=2100 jiffies g=358306033 q=2241752 ncpus=16) > kernel: CPU: 10 UID: 0 PID: 1524681 Comm: map_0x178e45670 Not tainted 6.12.21-gentoo #1 > kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > kernel: RIP: 0010:btrfs_get_64+0x65/0x110 > kernel: Code: d3 ed 48 8b 4f 70 48 8b 31 83 e6 40 74 11 0f b6 49 40 41 bc 00 10 00 00 49 d3 e4 49 83 ec 01 4a 8b 5c ed 70 49 21 d4 45 89 c9 <48> 2b 1d 7c 99 09 01 49 01 c1 8b 55 08 49 8d 49 08 44 8b 75 0c 48 > kernel: RSP: 0018:ffffbb7ad531bba0 EFLAGS: 00000202 > kernel: RAX: 0000000000001f15 RBX: fffff437ea382200 RCX: fffff437cb891200 > kernel: RDX: 000001922b68df2a RSI: 0000000000000000 RDI: ffffa434c3e66d20 > kernel: RBP: ffffa434c3e66d20 R08: 000001922b68c000 R09: 0000000000000015 > kernel: R10: 6c0000000000000a R11: 0000000009fe7000 R12: 0000000000000f2a > kernel: R13: 0000000000000001 R14: ffffa43192e6d230 R15: ffffa43160c4c800 > kernel: FS: 000055d07085e6c0(0000) GS:ffffa4452bc80000(0000) knlGS:0000000000000000 > kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > kernel: CR2: 00007fff204ecfc0 CR3: 0000000121a0b000 CR4: 00000000001506f0 > kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > kernel: Call Trace: > kernel: <IRQ> > kernel: ? rcu_dump_cpu_stacks+0xd3/0x100 > kernel: ? rcu_sched_clock_irq+0x4ff/0x920 > kernel: ? update_process_times+0x6c/0xa0 > kernel: ? tick_nohz_handler+0x82/0x110 > kernel: ? tick_do_update_jiffies64+0xd0/0xd0 > kernel: ? __hrtimer_run_queues+0x10b/0x190 > kernel: ? hrtimer_interrupt+0xf1/0x200 > kernel: ? __sysvec_apic_timer_interrupt+0x44/0x50 > kernel: ? sysvec_apic_timer_interrupt+0x60/0x80 > kernel: </IRQ> > kernel: <TASK> > kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > kernel: ? btrfs_get_64+0x65/0x110 > kernel: find_parent_nodes+0x1b84/0x1dc0 > kernel: btrfs_find_all_leafs+0x31/0xd0 > kernel: ? queued_write_lock_slowpath+0x30/0x70 > kernel: iterate_extent_inodes+0x6f/0x370 > kernel: ? update_share_count+0x60/0x60 > kernel: ? extent_from_logical+0x139/0x190 > kernel: ? release_extent_buffer+0x96/0xb0 > kernel: iterate_inodes_from_logical+0xaa/0xd0 > kernel: btrfs_ioctl_logical_to_ino+0xaa/0x150 > kernel: __x64_sys_ioctl+0x84/0xc0 > kernel: do_syscall_64+0x47/0x100 > kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53 > kernel: RIP: 0033:0x55d07617eaaf > kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 > kernel: RSP: 002b:000055d07085bc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > kernel: RAX: ffffffffffffffda RBX: 000055d0402f8550 RCX: 000055d07617eaaf > kernel: RDX: 000055d07085bca0 RSI: 00000000c038943b RDI: 0000000000000003 > kernel: RBP: 000055d07085bea0 R08: 00007fee46c84080 R09: 0000000000000000 > kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 > kernel: R13: 000055d07085bf80 R14: 000055d07085bf48 R15: 000055d07085c0b0 > kernel: </TASK> The RCU stall could be because there's a large number of backrefs for some extents and we're spending too much time looping over them without ever yielding the cpu. Avoid the stall warning by adding `conf_resched()`. Link: https://lore.kernel.org/linux-btrfs/CAMthOuP_AE9OwiTQCrh7CK73xdTZvHsLTB1JU2WBK6cCc05JYg@mail.gmail.com/T/#md2e3504a1885c63531f8eefc70c94cff571b7a72 Signed-off-by: Kai Krakow <kk@netactive.de>
Signed-off-by: Kai Krakow <kai@kaishome.de>
|
I tried this patch with CachyOS kernel 6.18.0. After recompiling and rebooting, Is there anything I am missing? |
Thanks for the reporty. I can confirm the same. |
|
Thanks for the report. Will fix... |
Add the following flags to give a hint about which chunk should be allocated on which a disk. The following flags are created: - BTRFS_DEV_ALLOCATION_PREFERRED_DATA preferred data chunk, but metadata chunk allowed - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA preferred metadata chunk, but data chunk allowed - BTRFS_DEV_ALLOCATION_METADATA_ONLY only metadata chunk allowed - BTRFS_DEV_ALLOCATION_DATA_ONLY only data chunk allowed Co-authored-by: Goffredo Baroncelli <kreijack@inwid.it> Signed-off-by: Kai Krakow <kai@kaishome.de>
Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>
v2: Adds a check to prevent modification while the file system is still mounting. Todo: - Transactions should not be triggered from sysfw writes, see: https://lore.kernel.org/linux-btrfs/20251213200920.1808679-1-kai@kaishome.de/ Link: #36 (comment) Reported-by: Eli Venter <eli@genedx.com> Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>
7e81d2c to
8a8411c
Compare
|
@CHN-beta @Forza-tng Thanks for reporting and confirming. This was actually a bug I introduced when I made allocator hints configurable via Important: This means, whoever used the 6.18 patches until now, never had allocator hints enabled since. Please use the new patches, then verify that If it lists devices with unexpected meta data, take note of the affected device IDs, then run a meta data balance filtered for device ID (separate each ID by spaces): for ID in {SLOW_DEV_IDs}; do btrfs balance start -mdevid=$ID --enqueue {BTRFS-MOUNT-PATH}; doneE.g., run Thanks. |
Please don't worry about it. Everyone makes mistakes sometimes, and this one didn't actually cause any damage. Thank you for your contribution! |
When this mode is enabled, the chunk allocation policy is modified as follows: Each disk may have a different tag: - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA - BTRFS_DEV_ALLOCATION_METADATA_ONLY - BTRFS_DEV_ALLOCATION_DATA_ONLY - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default) Where: - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for the X chunk type (the other type may be allowed when the space is low) - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type. This means also that it is a preferred choice. Each time the allocator allocates a chunk of type X, first it takes the disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X. If the space is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY. If the space is not enough, it uses also the other disks, with the exception of the one marked as ALLOCATION_PREFERRED_Y, where Y is the other type of chunk (i.e. not X). Co-authored-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Kai Krakow <kai@kaishome.de>
This is useful where you want to prevent new allocations of chunks on a disk which is going to be removed from the pool anyways, e.g. due to bad blocks or because it's slow. Signed-off-by: Kai Krakow <kai@kaishome.de>
This is useful where you want to prevent new allocations of chunks to a set of multiple disks which are going to be removed from the pool. This acts as a multiple `btrfs dev remove` on steroids that can remove multiple disks in parallel without moving data to disks which would be removed in the next round. In such cases, it will avoid moving the same data multiple times, and thus avoid placing it on potentially bad disks. Thanks to @Zygo for the explanation and suggestion. Link: kdave/btrfs-progs#907 (comment) Signed-off-by: Kai Krakow <kai@kaishome.de>
This adds read stats per device to devinfo to evaluate the effects of different read policies better. This adds a new file /sys/fs/btrfs/BTRFS-UUID/devinfo/ID/read_stats. Signed-off-by: Kai Krakow <kai@kaishome.de>
Read policies seem safe and stable enough to move it out of the experimental feature set. This allows us to add more policies without forcing users to enable the full experimental feature set. Signed-off-by: Kai Krakow <kai@kaishome.de>
Select the preferred stripe based on the mirror with the least in-flight requests. Signed-off-by: Kai Krakow <kai@kaishome.de>
8a8411c to
6137992
Compare
|
Updated the branch to improve the style of some if statements. |
|
Added a Github workflow to automatically compile and build-check the patches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Export patch series: https://github.com/kakra/linux/pull/40.patch
btrfs: tiered allocation hints and queue-based read balancing
Special Thanks to @Forza-tng for extensive testing, feedback, and maintaining the documentation guide:
👉 Btrfs Allocator Hints and Read Policies Guide by Forza-tng
This PR introduces a set of patches to improve Btrfs performance and flexibility in heterogeneous storage environments (mixed SSD/HDD, tiered storage, bcache).
1. Allocator Hints (Data Placement)
Allows preferring specific devices for data or metadata allocations. This works by storing a hint in the persistent device item on-disk.
2. Read Balancing Policies
Extends Btrfs RAID1 read balancing with a dynamic
queuepolicy. The standard PID-based policy is often insufficient for mixed-device pools or high-IOPS workloads.pid: (Default) Static hashing by process ID. Good for simple setups.round-robin: Distributes reads equally. Good for aggregate throughput on identical disks.queue: (Recommended) Routes requests to the device with the fewest in-flight requests (shortest queue).devid: Pin reads to a specific device ID (mostly for testing).(Note: Previous experimental latency-based policies were dropped in favor of
queuedue to better stability and lower complexity.)3. Decoupling from Experimental Status
Important: Upstream Kernels (6.13+) use
CONFIG_BTRFS_EXPERIMENTALto gate various unstable work-in-progress features. To allow using the allocator hints and read policies without enabling potentially unstable upstream code, these features have been moved out of the experimental gate.Recommendation: Remove the line
CONFIG_BTRFS_EXPERIMENTALfrom your.configbefore runningmake oldconfig. The build system will then prompt you specifically for the new options (CONFIG_BTRFS_ALLOCATOR_HINTS,CONFIG_BTRFS_READ_POLICIES), allowing you to enable them safely without turning on other experimental Btrfs features.Quickstart Guide
Setting Allocator Hints
CONFIG_BTRFS_ALLOCATOR_HINTSin kernel config.btrfs device usage /mnt/pathto identify your device IDs.echo <TYPE> | sudo tee /sys/fs/btrfs/<UUID>/devinfo/<DEVID>/typeAvailable Types:
0: Prefer data (Default for HDDs).1: Prefer metadata (Recommended for SSDs/NVMe).2: Metadata only (Use with caution).3: Data only (Use with caution).4: None preferred (Avoids new allocations, useful to drain a drive).5: None (Strictly prevents ANY new allocation, useful for parallel device remove).After changing hints, a rebalance of metadata/data is required to move existing extents to their preferred location.
Enabling Read Policy
CONFIG_BTRFS_READ_POLICIESin kernel config.btrfs.read_policy=queueecho queue | sudo tee /sys/fs/btrfs/<UUID>/read_policyDiagnostic Statistics
Adds per-device read statistics to
/sys/fs/btrfs/<UUID>/devinfo/<DEVID>/read_stats.ios: Total read I/O count.wait: Total accumulated wait time (ns).avg: Cumulative average read latency (ns).age: "Fairness" counter. Increments when the device is skipped/ignored during selection. Resets to 0 when selected. A constantly high age indicates the device is being avoided by the policy.ignored: Total count of times this device was a candidate but skipped.Benchmark Results
The following benchmarks (based on kernel 6.12 LTS) compare the new policies against the defaults. Tests were performed on a mixed HDD RAID10 array with bcache, comparing an idle system vs. a system under heavy background load (defrag).
queueproved to be the superior all-rounder, effectively isolating foreground workloads from background noise.Scenario: No Background Load
pidround-robinlatency-rr*queueScenario: Heavy Background Load (Defrag)
pidround-robinlatency-rr*queue(
latency-rrwas an experimental hybrid policy used during testing, superseded byqueuedue to better performance and simplicity)Changes in this version (Kernel 6.18 Port)
queuepolicy.FAQ: Why is this not upstream?
1. Allocator Hints
The allocator hint patches (originally developed by Goffredo Baroncelli, now maintained here) have been discussed on the mailing list but were not merged for design reasons:
df): Btrfs calculates available space assuming any chunk can be allocated on any device (respecting RAID profiles). Restricting allocations via hints makes this calculation unreliable. Tools might report free space while Btrfs returnsENOSPC(No space left on device) because the allowed devices for a specific chunk type are full, even if other devices are empty.Compatibility Note: This patch reuses the existing (unused)
typefield in the device item on disk. It does not change the on-disk format version. Unpatched kernels simply ignore the value, ensuring data remains accessible (though allocation preferences will be lost until booted with a patched kernel).2. Read Policies (
queue)The new
queuepolicy is an experimental addition in this patch set. It is unlikely to be accepted upstream in its current form due to Layer Violation: